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ABSTRACT 

Observational ratings of student clinical performance 
are influenced by factors other than the quality of the performance. 
Individual raters may be -more stringent or 4 lenient than their 
colleagues* In this medical school setting, multiple raters evaluated 
each student. To reduce the influence of "error" due to differences 
among raters, each rater wis assigned a handicap score which was 
calculated in three steps: (1) identify the cohort of students 
observed by £he rater, (2) calculate the mean of all faculty ratings 
for that cohort (grand mean) and the mean given those students by the 
rater, and (3) subtract the individual rater mean from the grand 
mean. Analysis of the "orfgintfl" and "adjusted" ratings for two 
academic years indicated no differences in overall mean and standard 
deviation* Gener'alizability analysis indicated an improvement 
equivalent to increasing the number of raters per student by 50 
percent (i.e., the variance component due to error was reduced by 
about 33 percent). (Author) 
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Introduction 

Observational ratings are a widely used method for assessing student clinical 
performance in health science education. Common measurement errors associated with 
rating forms include errors of leniency and cerftral tendency, halo effect, logical 
error, proximity error and contrast error. (DeMers, 1978) Wherry (1952) discusses 
rating errors by using an equation to picture the complexity of the rater's task. 
The recorded rating score can be represented by the following equation: 

RS » (A + e fa ) + (E + e e ) + (B + e p ) + e p 

In this equation, RS is the recorded score, A is the ability of the student, E 
is environmental influence, B is the bias of the rater, and e represents errors 
due to atypical behavior of the student, unexpected changes in the environment, 
aberrant perceptions by the rater and random fluctuations respectively. In classical 
test theory terminology, A is equivalent to a true score. E represents the influence 
of environmental factors such as the format of the rating form, training and motiva- 
tion levels of the raters, and the performance situations "in which students are 
observed. B represents bias due to the idiosyncracies of an individual rater. This 
study reports a procedure for adjusting recorded scores to reduce tBe influence of 
,«? te r ,? ias - If the numerical size of (B + e p ) is reduced in the equation above, 
RS will be a more accurate estimate of the student's ability level (A). 

Nunnally (1978) points out that raters differ in leniency, the tendency to say 
good or bad things .about people in general. In an educational context, students 
would describe raters with leniency errors as "tough" or "easy" graders. 
Littlefield, et. al., (1981a) demonstrated that differences in rater leniency of 
medical faculty were constant over a five year period despite annual comparative 
feedback to the faculty. Cason and Cason (1981) propose a construct called Rater 
Reference Point to account for individual differences in rater leniency. The, Cason 
model uses latent trait theory to estimate each rater's reference point (i.e., 
leniency error). This report proposes a similar adjustment to ratings by individual 
faculty raters; however, instead of latent trait theory, individual faculty are 
assigned a "handicap" score based upon the mean of all of the various faculty 
ratings given to the students observed by the individual rater. 

Method 

The subjects in this study are 203 medical faculty and residents who rated at 
least five junior medical students during a 3 1/2 week Internal Medicine Clerkship 
in academic years 1981 and 1982. The requirement to have" rated at least five 
students was arbitrarily imposed to insure that each rater had performed sufficient 
ratings to establish a "stable mean." The design of the rating form and the role 
of attending faculty have been described previously. (Littlefield, et. al., 1981b). 
Performance was rated on each of five items on a 0-to-14 point numerical scale. 
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A total of 355 medical students were each rated by 5 to 9 raters during academic 
years 1981 and 1982. A "handicap" score was calculated for each "subject" rater 
1n three steps: 1) identify the cohort of students rated by the Individual faculty 
member during the academic year (range ■ 5 to 49); 2) calculate the mean of all 
faculty ratings for that cohort (grand mean) and the mean rating given those 
students by the Individual faculty member and 3) subtrart the Individual rater 
mean from the grand mean (h*x -x). If a rater received a positive "handicap" 
score, his/her mean score was 9 1ower than the grand mean for all, raters who observed 
that cohort of students/ All individual faculty ratings were "adjusted" by adding 
the handicap score to the original ratings. The result was two sets of ratings 
for each student, original and adjusted. The data sets were edited using two criteria: 
(1) eliminate student records which do not have at least four ratings by "subject" 
raters and (2) randomly delete ratings from student records with more than four. The 
final result was two 4 X 162 matrices (adjusted and unadjusted student ratings) for 
1981 and two 4 X 144 matrices for 1982. General izablHty analyses (Brennan and ' 
Kane, 1977) were performed on the original and the adjusted ratings. This analysis 
uses an analogy to communications systems to assess the precision of the scores. The 
variance component due to differences between students (the signal) 1s compared to 
the variance component due to differences among raters of the same student (noise). 
Variance components are statistical estimates of the hypothesized components of an 
observed score (Cronbach, et. al., 1972). The numerical size of the variance 
component due to differences between students is directly related to the standard 
deviation of the mean rating given to each student. The numerical size of the variance 
component due to differences between raters of the same student is directly related to 
how closely the four raters agree. The BMDP-8V program (Dixon and Brown, 1979) was 
used to compute variance components. 

Results 

Table 1 reports the overall mean ratings, standard deviations and range for 
the original and adjusted ratings in academic years 1981 and 1982. It appears that 
the adjustments did not substantially change the overall leniency of the ratings or 
the "spread" among Individual student scores. Table 2 presents a frequency distribu- 
tion of the number of faculty with various levels of "handicap" scores. A Kolmogrov 
Smirnoff test Indicates that the handicap scores in 1981 and 1982 approximate a normal 
distribution* Like many human traits, a few Individuals apparently have rather ex- 
treme positive or negative leniency error,' but most raters are near zero. The overall 
means of the handicap scores are 0.04 1n 1981 and 1982 with standard deviations of 
.851 and .843 respectively. Table 3 1s an analysis of var1ance*summary table for 
the original and adjusted rating^. Notice 1n the adjusted ratings that the sums of 
squares due to differences between raters of the same student decrease substantially 
from the original ratings. This would be expected since the handicap score adjusts 
each Individual faculty's ratings toward the grand mean for the cohort of students 
rated. / 

Table 4 presents the intraclass. correlation coefficients for the original and 
adjusted ratings. These coefficients summarize the ability of the ratings to 
separate the "signal," in this case the differences (variance) among students, from 
the "noise." The coefficients can vary from 0.0 to 1.0. The 1981 original ratings 
-coefficient 1s 1n the same range as' those reported by L1ttlef1e1d, «t. al. (1981b) 
J- when adjusted to reflect four raters. The 1982 original ratings coefficient 1s hlaher 
due to a .larger signal (variance component due to differences between students), the 
adjusted ratings coefficients are larger than the original ratings coefficient due to 
an Increase 1n the strength of the "signal" and a decrease 1n the "noise." The decrease 
in "noise" reflects the reduced mean square due to differences between raters of 
the same student. The Increased "signal" strength 1s also related to reduced noise 
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since it is calculated by the expected mean square CEMS) equation: EMS*4$* + «i/ s \* 
In this equation EMS is set equal to the mean square. due to differences between v 
students. the variance component due to differences between raters of the 

same student * 'is set equal to its analogous mean square. With estimates of EMS 
and a?(s). the equation can then be solved to find d 2 st . Table 5 demonstrates the 
algebraic manipulation. Table 6 demonstrates the change of individual student scores 
for the 1981 academic year. 

Discussion and Conclusions 

This study has implications in two areas: making decisions about students based 
upon observational ratings and improving the precision of ratings. A large clinical 
department utilizes many faculty raters and they are likely to differ inJeniency 
error. With relatively random assignment of students to raters, some students will 
be assigned entirely to stringent raters and their mean raiting score in this system 
(0-14-scale) will be one to two points lower than their performance level justified. 
The seriousness of this problem depends on the types of decisions to be made. In 
this particular system, it might result in the change of a letter grade, but not in 
outright failure because failure decisions are reviewed individually by the clerk- 
ship director. Table 6 shows that only 55% of the students in 1981 would remain in 
the same decile as their unadjusted mean rating* score. The changes in decile for 
students in 1982 were not calculated; however, they would be less pronounced because 
the variance component due" to students (signal) is much larger indicating that the 
scores are more spread out. 

Adjusting rating scores is an, inexpensive method of improving the precision of 
rating systems. Landy and Farr (1980) in a review of the research on performance 
ratings note that training raters will reduce rating errors if the training is 
sufficiently extensive. In this rating system, over 200 raters observe the students 
therefore the logistics of training Waters are formidable if not prohibitively 
difficult. By contrast, the use of "adjusted scores" represents an. improvement in 
precision of the scores with no additional .costs. The degree of improvement will 
depend on the relative strength of the "signal" and "noise" variance components. 
The improvement in the 1982 intraclass correlation coefficient was less striking 
than in 1981. From an organizational development perspective, it would be critically 
important to involve the raters in the decision to adopt , adjusted scores. The 
validity of the ratings depends upon the- conscientious efforts of the raters and 
the validity of the whole process would suffer immensely if they are trying to 
"beat the system." 

This study has demonstrated a method for estimating the effects of rater btfcs 
on recorded scores as outlined by the equation: 

- * • 
RS « (A+e b ) + (E+e e ) + (B+e p ) + « r , 

The handicap scores are a composite estimate of B+ep Each rater submitted only 
one rating per student, therefore, random errors or perception cannot be separated 
from the effects of overall bias (B). The findings of the study must be qualified 
by noting that the requirement that subject raters have completed at least five 
ratings resulted 1n deleting about 50% of the raters from each <academ1c year. Five 
ratings established a "stable mean" for each. rater from which his/her handicap 
score could be calculated. It seems likely that raters who complete -less than five 
-•ratings annually are more susceptible to leniency error than their colleagues who 
rate larger number of students. Landy and Farr (1980) emphasize the heed to learn 
more about the way raters observe, encode, store, retrieve and recoYd^perf ormance 



Information. With that research, berhaps -answers will come to Questions such as the 
accuracy of ratings by "occasional" raters. 
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Descriptive Statistics for Original and Adjusted Ratings 



1981 




-4 _ 

1982 




Original 


Adjusted 


Original 


AHiucfpH 


x « 9,17 


x « 9.26 


x * 9.29 


x » 9 33 


a * 1.18 


a * 1.16 


a * 1.91 


U 1 • DO 


Dsnn<\x£ O 19 A 

Kange x o . c- i c .4 


Range-5. 97-12,11 


Range»3.0-14.0 


Range=4. 13-14.87 


* 


TABLE 2 






Frequency Distribution of Handicap Scores 




1981 


No. of 


1982 


No. of 


Score Range 


Faculty Raters 


Score Range 


Faculty Raters 


-1.99 to -1.50 


3 


-2.41 to -1.50 


5 


- J .13 to - 1 .UU 


14 


-1.49 to -1.00 


7 


- .99 to - .50 


13 


- .99 to - .50 


13 


- .49 to 0.00 


13 


- .49 to 0.00 


22 


.01 to .49 


25 


.01 to .49 


25 


- .50 to .99 


21 


.50 to .99 


18 


1.00 to 1.49 




1.00 to 1.49 


10 


1.50 to 2.01 


3 - 


1.50 to 2.01 


3 




1017 
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TABLE 3 

AN0VA Summary Tables & Variance Components for Original and 





Adjusted 


Source 
Students 


40 Sum of Squares 
862.76 


D.F. 
TFT 


Mean Square 
5.36 


Variance Component 
M 


1981 


Ratings 


Di f f erence 
Bet . Raters 


821.50 


486 


1.69 


1.69 




Original 
Ratings 


Students 

Difference 
Bet. Raters 


898.76 
1392.01 


161 
486 


5.58 
2.86 


.68 
2.86 




Adjusted 


Students 


915.63 ' 


143 


6.40 ' 


1.19 


1982 


Ratings 


Differences 
Bet. Raters 


708.25 


432 


1.64 


1.64 




Original 
Ratings 


Students 


1007.2 


143 ' 


7.04 


1.13 


<• 




Di ff erence 
Bet. Raters 


1096.2 


432 


-2.54 


2.54 
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TABLE 4 

Intraclass Correlation Coefficient for Original and Adjusted 

Ratings 



Conceptual 
Jtodel 

Adjusted 
Ratings 

Original 
Ratings 



1981 
p« signal 



signal + noise 

p » .92 
.92 + 1.69/4 

p « ,68 
.68 + 2.86/4 



.69 
.49 



1982 
signal 



ignai 



signal + noise 

p» 1.19 • 
1.19 + 1.6474* 

P « 1.13 
1.13 + 2.54/4 



■.74 
.64 



TABLE 5 

Calculating the Variance Component 
Due to Student Differences 

EHS " 4 9 st + «f(.t) 

M = t - * 9 st + M5 r(st) 



'st 



* MS^ - MS 



r(st) 



TABLE 6 

Impact on Decisions .about Students in 1981 
Class Quartlle Changes 
(N « 162) 



Down 


No Change 


Up 


* /• 


, 83% 


8% 


Class Decile Changes 
(N - 162) 

* • 



Down 2 


Down 1 


No Change 


Up 1 


Up 2 


& 


22% 


55% 


19% 


2% 



9 

ERIC 



'8 



EFFECT OF OPEN AND CLOSED 

QUESTIONS ON PARTICIPATION 



J 

Closed 



i . s o 



Memory 



4.90 

Open 



4.00 ' 

Analysis 



r«w», TtaehlRg Dtvilopnint Niwilatttr, 1 9io 



* Types of Lead off Questions 



TYPE: RESPONSE: 



Quiz Show 1 .so 



Fishing 2,00 



Shotgun 2. so 



Metaphysical 
(General Invitation) 



2.50 



Structured Open 5.00 
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CLOSED QUESTIONS 

characteristics: 

•predictable answers 

2 

<> 

• tests memory of student 

• often yes/no, or one word answer 

\ 

9 will not stimulate discussion 

\ 

I 

I 

* 

« o 
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adjusting msmwmm>L iwrmss to imtm inter-rateb caf»isiprv 



Observational ratings of student clinical pet romance are influenced by 
factors other than the quality of the performance. Individual raters may be 
more stringent or lenient than their colleagues. In this medical school 
setting, multiple raters evaluated each student. To reduce the influence of 
"error" due to differences among raters, each rater' wis assigned a handicap 
score which Was ealuculated in three steps: (1) identify the cohort of 
students observed by the rater, ^2) calculate the mean of al\ faculty^ ratings 
for that cohort (grand mean) and the mean given those students by the rater, 
and (3) subtraqt the individual rater mean from the grand meam analysis of 
the "original" and "adjusted" ratings for two academic years indicated no 
differences in overall mean and standard deviation. Generalisabflity analysis 
indicated an improvement equivalent to increasing * the number of raters per 
student by 50 percent (i.e., the variance component due to error was reduced 
by about 33%), 
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